Song Analysis Project
Introduction
Based on what we discussed in our initial consulting meeting, our understanding is that you have 2 goals in regards to this data analysis. The first and primary goal being to recommend users of your music software, songs to listen to next based on the characteristics (danceability, acousticness, etc) of the current song they are listening to. Essentially a song recommendation algorithm, using previously listened to songs. The secondary goal is to be able to predict the genre of a song based on the same set of song characteristics.
It is our understanding that the desired outcome of this project is to increase the ratings/user experience of your music software, the improved recommendation algorithm will help to aid this goal and will be delivered in the form of a front end application (shiny app) along with a supplementary report outlining the steps of the analysis. Given this, the target audience is all users of your software. The shiny app will allow users to select a song and then recommend a list of songs listed in order of similarity based on the selected song.
The data you have provided us come in the form of 2 separate datasets, one for the primary goal and one for the secondary goal.
The first dataset contains 130,000 songs of which are available on your music software. Each row of the dataset is a specific song and is accompanied by a variety of variables (acousticness, danceability, energy, etc), a subset of these variables will be used to cluster the songs into groups. The song recommendations will be based on these clusters. For example, if a user of the app inputs a song, the list of recommended songs will be a subset of the songs in the input songs cluster.
Here is the link to the data described above: https://www.kaggle.com/datasets/tomigelo/spotify-audio-features?select=SpotifyAudioFeaturesApril2019.csv
The second dataset is a different subset of songs from your music software, this time 114,000 songs, the difference between this dataset and the first one is that this one contains the track genre attached as a column. The other columns contain the same exact song characteristics as the previous dataset. This dataset will be used to train a model that is able to predict the genre of a song based on the specific characteristics of the song.
Here is the link to the data described above: https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset/data
Regarding the ethics of this analysis, one big issue could be the scope of the dataset being used for the recommendation algorithm. Not all artists are contained in the dataset, so, artists not present in the dataset will not be recommended as song choices for the user. This exclusion of certain artists is not a good thing as depending on how the songs for this dataset were chosen, artists could be systematically excluded. All artists are not being given a fair shot. It is likely that smaller artists will have less songs in the dataset meaning they will have a lesser chance at being recommended, while more popular artists will likely have more songs and be more likely to be recommended. So, this algorithm in its current state may reinforce the status quo of the music industry by keeping popular artists popular and barring less popular artists from breaking through.
Previous Work
Article:
- https://medium.com/neemz-product/the-inner-workings-of-spotifys-ai-powered-music-recommendations-how-spotify-shapes-your-playlist-a10a9148ee8d
- https://www.hopkinsmedicine.org/health/wellness-and-prevention/keep-your-brain-young-with-music
- https://www.northshore.org/healthy-you/9-health-benefits-of-music/#:~:text=Music%20can%20boost%20the%20brain’s,involved%20in%20mood%20and%20emotions.
Summary of the first article:
Spotify’s rise to the top of the music streaming world is largely due to its ingenious use of AI to create personalized music experiences for its massive user base of over 500 million monthly listeners. The platform’s success hinges on its ability to understand user preferences and behaviors through data analysis and collaborative filtering. This process connects users with music that aligns with their tastes, creating playlists that feel custom-made. Spotify dives deep into the nuances of each track, considering aspects like beats, lyrics, and cultural context, to make finely-tuned recommendations.
As Spotify continues to evolve, it faces challenges like the cold start problem with new artists and potential biases in recommendations. To address these, it incorporates human expertise alongside its algorithms and continuously adjusts its system to ensure fairness and diversity. Looking to the future, Spotify is exploring innovative concepts like a virtual DJ persona, enhancing user interaction and making music discovery more engaging and personal. This blend of advanced technology and human insights sets Spotify apart in curating unique and resonant musical journeys for each user.
Summary of the second article:
This article explores how our brains process and perceive music. When we listen to music, sound vibrations travel through the air, enter the ear canal, and vibrate the eardrum. These vibrations are then converted into electrical signals that travel along the auditory nerve to the brain stem. Here, these signals are reconstructed into what we recognize as music.
The article also highlights research conducted by Johns Hopkins University, where jazz musicians and rappers were asked to improvise music while inside an fMRI machine. This study aimed to observe which areas of the brain are activated during musical improvisation.
The article emphasizes that music is not just an art form but also possesses structural, mathematical, and architectural qualities. It relies on the relationship between successive notes. Unbeknownst to many, the brain performs complex computations to interpret and make sense of music, illustrating the intricate connection between music and cognitive processes.
Summary of the third article:
The article highlights the numerous health benefits of music, emphasizing its ability to positively impact mood, reduce pain and anxiety, and provide a means for emotional expression. It delves into the role of music therapy in healthcare, particularly in hospice and palliative care. The therapy, conducted by board-certified music therapists, is used to complement traditional treatments for various conditions, including anxiety, depression, stress, pain management, and improving function in degenerative neurological disorders. The article underscores the significant role that music can play in enhancing both physical and mental health.
Similarities and Differences of ours work to the articles:
For our project, our main works/focuses are similar to the first article where we are trying to create a “machine learning” algorithm that helps/makes musics more engaging to the users where our project aim to recommend songs/playlists based on the songs that the users were previously listened to (i.e., recommend songs to each of the user’s taste). We would also deploy an user-interactive app (Shiny App) that displays the songs/playlists to play next.
Based on the second and the third article, we found that musics have the ability to positively impact mood, reduce pain and anxiety, and provide a means for emotional expression.
Since the later two articles are not “previous work”, so, the only difference of our project compared to the first article is that we also predict genre of songs based on a variety of measurements.
Exploratory Analysis
Variables of Interest:
Acousticness: confidence measure on if the song is acoustic
Danceability: measure of how well suited the song is for dancing
Energy: measure of intensity and activity
Instrumentalness: measure of the level to which the song is an instrumental track
Key: The key
Liveness: measure of the likelihood an audience was present during the songs recording
Loudness: measure of loudness in decibels
Mode: the modality (major or minor, 1 or 0)
Speechiness: measure of the quantity of spoken words
Tempo: beats per minute
Time Signature: estimated time signature
Valence: measure of positivity
Track Genre: song genre
Primary Goal Clustering Data:
track_id acousticness danceability duration_ms
Length:130663 Min. :0.0000 Min. :0.0000 Min. : 3203
Class :character 1st Qu.:0.0316 1st Qu.:0.4590 1st Qu.: 163922
Mode :character Median :0.2030 Median :0.6050 Median : 201901
Mean :0.3425 Mean :0.5815 Mean : 212633
3rd Qu.:0.6360 3rd Qu.:0.7270 3rd Qu.: 241048
Max. :0.9960 Max. :0.9960 Max. :5610020
energy instrumentalness key liveness
Min. :0.0000 Min. :0.000000 1 :15348 Min. :0.0000
1st Qu.:0.3960 1st Qu.:0.000000 0 :14972 1st Qu.:0.0975
Median :0.6030 Median :0.000149 7 :13930 Median :0.1240
Mean :0.5692 Mean :0.224018 2 :12363 Mean :0.1949
3rd Qu.:0.7750 3rd Qu.:0.440000 9 :12029 3rd Qu.:0.2360
Max. :1.0000 Max. :1.000000 5 :10816 Max. :0.9990
(Other):51205
loudness mode speechiness tempo time_signature
Min. :-60.000 0:51254 Min. :0.0000 Min. : 0.00 0: 299
1st Qu.:-11.898 1:79409 1st Qu.:0.0389 1st Qu.: 96.01 1: 1749
Median : -7.979 Median :0.0559 Median :120.03 3: 12666
Mean : -9.974 Mean :0.1120 Mean :119.47 4:112652
3rd Qu.: -5.684 3rd Qu.:0.1290 3rd Qu.:139.64 5: 3297
Max. : 1.806 Max. :0.9660 Max. :249.98
valence popularity
Min. :0.0000 Min. : 0.00
1st Qu.:0.2240 1st Qu.: 7.00
Median :0.4200 Median : 22.00
Mean :0.4396 Mean : 24.21
3rd Qu.:0.6380 3rd Qu.: 38.00
Max. :1.0000 Max. :100.00
Comment: Above is the summary of the distribution of the numerical variables in this dataset. Key, Mode, time_signautre are oridinal variables, the output display counts associated with that level.
# A tibble: 6 × 2
artist_name track_name
<chr> <chr>
1 YG Big Bank feat. 2 Chainz, Big Sean, Nicki Minaj
2 YG BAND DRUM (feat. A$AP Rocky)
3 R3HAB Radio Silence
4 Chris Cooq Lactose
5 Chris Cooq Same - Original mix
6 Curbo Debauchery - Original mix
Comment: Above is the quick summary of how artist_name and track_name are inserted into this dataset. The dataset include songs from the same artist rather than having one artist for a song. There are too many artist in this data, therefore, the output will be a mess if I were to do a groupby artist name and count the songs (i.e., a table that displays number of songs for each artist).
Code
Figure 1: Danceability Distribution
The distribution looks somewhat symmetry with a slight left-skewed. This shows that across all of the songs, most songs have a high danceability — around .5-.7. Given that we are going to perform a unsupervised learning clustering for this part, it might not be beneficial to add the danceability into our models since most songs have relatively the same danceability.
Code
audio %>%
ggplot(mapping = aes(x = acousticness)) +
geom_histogram(bins=40,fill = "pink", color = "black") +
labs(x="acousticness of a song", y="Count")+
ggtitle("Distribution of a confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic")Figure 2: Acousticness Distribution
The distribution looks equally spread out in the middle range (.25-.75) and some peaks towards both end. This suggests us that songs that are more similar are likely to have the same acousticness. In other words, songs are not clustered into one specific range, but rather they are spread out. Therefore, we should consider the acousticness into our model.
Code
Figure 3: Tempo Distribution
Most songs are clustered around 90-150 with some peaks towards the end of this range. This shows that the tempo might not be good for clustering since most songs have relatively the same range of tempo.
Secondary Goal Genre Prediction Data:
Code
popularity duration_ms explicit danceability
Min. : 0.00 Min. : 0 Mode :logical Min. :0.0000
1st Qu.: 17.00 1st Qu.: 174066 FALSE:104253 1st Qu.:0.4560
Median : 35.00 Median : 212906 TRUE :9747 Median :0.5800
Mean : 33.24 Mean : 228029 Mean :0.5668
3rd Qu.: 50.00 3rd Qu.: 261506 3rd Qu.:0.6950
Max. :100.00 Max. :5237295 Max. :0.9850
energy key loudness mode
Min. :0.0000 Min. : 0.000 Min. :-49.531 Min. :0.0000
1st Qu.:0.4720 1st Qu.: 2.000 1st Qu.:-10.013 1st Qu.:0.0000
Median :0.6850 Median : 5.000 Median : -7.004 Median :1.0000
Mean :0.6414 Mean : 5.309 Mean : -8.259 Mean :0.6376
3rd Qu.:0.8540 3rd Qu.: 8.000 3rd Qu.: -5.003 3rd Qu.:1.0000
Max. :1.0000 Max. :11.000 Max. : 4.532 Max. :1.0000
speechiness acousticness instrumentalness liveness
Min. :0.00000 Min. :0.0000 Min. :0.00e+00 Min. :0.0000
1st Qu.:0.03590 1st Qu.:0.0169 1st Qu.:0.00e+00 1st Qu.:0.0980
Median :0.04890 Median :0.1690 Median :4.16e-05 Median :0.1320
Mean :0.08465 Mean :0.3149 Mean :1.56e-01 Mean :0.2136
3rd Qu.:0.08450 3rd Qu.:0.5980 3rd Qu.:4.90e-02 3rd Qu.:0.2730
Max. :0.96500 Max. :0.9960 Max. :1.00e+00 Max. :1.0000
valence tempo time_signature
Min. :0.0000 Min. : 0.00 Min. :0.000
1st Qu.:0.2600 1st Qu.: 99.22 1st Qu.:4.000
Median :0.4640 Median :122.02 Median :4.000
Mean :0.4741 Mean :122.15 Mean :3.904
3rd Qu.:0.6830 3rd Qu.:140.07 3rd Qu.:4.000
Max. :0.9950 Max. :243.37 Max. :5.000
Above is the summary of numerical variables in the classification dataset. It describes the distribution of the numerical variables.
Code
Figure 4: Loudness vs. Danceability scatterplot
We do not see a clear distinct diference between genres. It suggests that the two variables (Loudness vs. Danceability) might not help contribute much into the groups classification.
Code
Figure 5: Loudness Distribution
The distribution of the loudness variable is left skewed. This means that most of the values are relatively high, with fewer lower values pulling the mean towards the left.
Code
Figure 6: Danceability distribution
To us, this looks like a multimodal distribution. This suggests that the dataset contains different genres and each genre has its own relative danceability distribution.
Code
Figure 7:
Although the time signature is a numerical variable, we should treat it as a categorical — time signature is not on a continuous numerical variable.
Code
Figure 8: Tempo Distribution
There are outliers in the tempo variable (e.g., 0 and 200). Most songs are lie in between 100-150. This suggests us that tempo might not be a good candidate for clustering analysis; most songs are clustered into a certain range — not spread out.
Analysis
Primary Goal (Clustering):
We have determined that the features, track_id, duration_ms, and popularity will not be used for the clustering of the data. The reason for this is that contextually, these features don’t make sense to use. Track_id has a unique value for each song, thus will not be any help in clustering the songs. Duration_ms represents the length of the song, we feel that the length shouldn’t be a determining factor in the clusters because it has nothing to do with the sound of the song, if a user likes the sound of the song, the length will likely not play a role in deciding whether they like it or not. Popularity, which is a numerical value based on the number of plays the song has had on Spotify was dropped because the popularity of a song has nothing to do with the actual makeup of the song. These variables were removed from the dataset, and categorical variables were converted to one hot encoded dummy variables.
Code
audio <- audio %>%
dplyr::select(-track_id, -duration_ms, -popularity)
categorical_vars <- c("mode", "key", "time_signature")
dummy_data <- dummyVars("~.", data = audio[categorical_vars], fullRank = TRUE) %>%
predict(audio[categorical_vars])
audio_encoded <- cbind(audio %>% select(-categorical_vars), dummy_data)Code
audio_matrix <- audio_encoded %>%
select(-artist_name, -track_name) %>%
as.matrix()
artists <- audio %>%
pull(artist_name)
songs <- audio %>%
pull(track_name)
pc <- prcomp(audio_matrix,
center = TRUE,
scale = TRUE)
new_dims_df <- pc$x %>%
as.data.frame() %>%
bind_cols(artists, songs) %>%
rename(Artist = `...26`, Song = `...27`)We performed a principal component analysis as a data pre processing step. The purpose of this is to reduce the number of features used for clustering by removing extraneous noise in the data, while still keeping information from all the initial features. We will perform the clustering on both some subset of the principal components as well as the original set of features and decide which method is the best at separating the songs into groups based on looking at silhouette scores.
Silhouette score is a metric often used within clustering to help determine the number of clusters to use in the analysis. Silhouette score is essentially a grade for how clear and distinct the clusters are. A high score (closer to 1) means that the groups we found are well-defined and separate from each other. To decide on the best number of clusters to use, we clustered the data multiple times, using a different amount of clusters each time, and examined which number of clusters gave the highest silhouette score. The number of clusters that leads to the highest silhouette score implies that that clustering set up leads to the most well defined and separate groups. Now, this wasn’t the only criteria for deciding the clusters, another factor was the simplicity of the clustering. That is, using a small number of clusters is simpler and easier to understand than using a large number of clusters. For example, a small number of clusters that leads to a silhouette score only slightly lower than the score of a large number of clusters, the smaller number of clusters will be chosen.
Clustering With PCA:
17 principal components account for more than 85% of the variation in the data, therefore we will use the first 17 PCs as the features for clustering. This is a reduction to 17 features from 25, this removal of the last 8 principal components removes extraneous noise from the data and we still have information from all of the 25 features in these first 17 PCs.
We will perform clustering on the 17 PCs, to determine how many clusters to use, we ran the clustering algorithm multiple times, using different amounts of clusters. The silhouette score will be calculated for each of the number of clusters, the number of clusters that leads to the highest silhouette score while also taking into account simplicity will be chosen. The plot can be seen below. It’s also worth noting that due to the size of the dataset we are working with (130,00 songs), we used a subset of 10,000 songs to test the different number of clusters to use due to computational limits of our computers.
Code
sample_pc <- new_dims_df %>%
sample_n(size = 10000, replace = FALSE)
k_values <- 2:30
silhouette_scores <- numeric(length(k_values))
for (k in k_values) {
kmeans_model <- kmeans(sample_pc[, 1:17], centers = k, algorithm = "MacQueen", iter.max = 1000)
cluster_assignments <- kmeans_model$cluster
silhouette_scores[k - 1] <- mean(silhouette(cluster_assignments, dist(sample_pc[, 1:17]))[ ,3])
}
scree_data <- tibble(k = k_values, silhouette = silhouette_scores)
ggplot(scree_data, aes(x = k, y = silhouette)) +
geom_line(color="green") +
geom_point(color="black") +
labs(title = "Scree Plot with Silhouette Scores Based on PC Clustering",
x = "Number of Clusters (k)",
y = "Silhouette Score")In the scree plot. it is apparent that 17 clusters is the optimal choice. This number of clusters leads to a high silhouette score while also not be a very large number of clusters. We see that there are other choices of clusters that have a slightly higher silhouette score, however, these higher number of clusters only marginally increase the silhouette score. Thus, the simper, smaller number of clusters with a still very high score comparatively, is the optimal choice. Now we will fit that optimal number of clusters.
Code
set.seed(27)
audio_km <- kmeans(new_dims_df[ , 1:17], centers = 17, algorithm = "MacQueen", iter.max=1000)
results_pca <- tibble(
clust = audio_km$cluster,
Artist = artists,
Song = songs)
table <- results_pca %>%
count(clust)
kable(table,
format = "html",
col.names=c("Cluster", "Num of Songs"),
align=c("c", "c")
)| Cluster | Num of Songs |
|---|---|
| 1 | 19617 |
| 2 | 8276 |
| 3 | 1962 |
| 4 | 2553 |
| 5 | 10847 |
| 6 | 675 |
| 7 | 9045 |
| 8 | 3509 |
| 9 | 15376 |
| 10 | 6798 |
| 11 | 7771 |
| 12 | 7757 |
| 13 | 12748 |
| 14 | 7757 |
| 15 | 10500 |
| 16 | 3297 |
| 17 | 2175 |
The plot below displays the results of the clustering in the dimensions of the first two principal components. These first 2 PCs account for around 21% of the variation in the data, so, the graph is not going to show distinct separation between the clusters. But, it is still gives us some sort of visual on how the clusters are being formed.
Code
Clustering With Original Features:
We perform clustering on the original 25 features (this includes the dummy variables), to determine the optimal number of clusters we will follow the same steps we had previously for the clustering on the PCs. The silhouette score will be calculated for each iteration with a different number of clusters, the number of clusters that leads to the highest silhouette score while also taking into account simplicity will be chosen. The plot can be seen below. It’s also worth noting that due to the size of the dataset we are working with (130,00 songs), we used a subset of 10,000 songs to test the different number of clusters to use due to computational limits of our computers.
Code
audio_matrix_normalized <- as.data.frame(audio_matrix) %>%
mutate_if(is.numeric, scale)
sample_full <- audio_matrix_normalized %>%
sample_n(size = 10000, replace = FALSE)
silhouette_scores <- numeric(length(k_values))
for (k in k_values) {
kmeans_model <- kmeans(sample_full, centers = k, algorithm = "MacQueen", iter.max = 1000)
cluster_assignments <- kmeans_model$cluster
silhouette_scores[k - 1] <- mean(silhouette(cluster_assignments, dist(sample_full))[ ,3])
}
scree_data <- tibble(k = k_values, silhouette = silhouette_scores)
ggplot(scree_data, aes(x = k, y = silhouette)) +
geom_line(color="green") +
geom_point(color="black") +
labs(title = "Scree Plot with Silhouette Scores Based on Scaled Features",
x = "Number of Clusters (k)",
y = "Silhouette Score")In the scree plot. it is apparent that 13 clusters is the optimal choice. This number of clusters leads to a high silhouette score while also not be a very large number of clusters. We see that there are other choices of clusters that have a slightly higher silhouette score, however, these higher number of clusters only marginally increase the silhouette score. Thus, the smaller number of clusters with a still very high score comparatively, is the optimal choice. Now we will fit that optimal number of clusters.
Code
set.seed(27)
audio_km <- kmeans(audio_matrix_normalized, centers = 13, algorithm = "MacQueen", iter.max=1000)
results <- tibble(
clust = audio_km$cluster,
Artist = artists,
Song = songs)
table <- results %>%
count(clust)
kable(table,
format = "html",
col.names=c("Cluster", "Num of Songs"),
align=c("c", "c")
)| Cluster | Num of Songs |
|---|---|
| 1 | 38845 |
| 2 | 8947 |
| 3 | 2327 |
| 4 | 3416 |
| 5 | 12273 |
| 6 | 809 |
| 7 | 14471 |
| 8 | 3506 |
| 9 | 8389 |
| 10 | 6971 |
| 11 | 8547 |
| 12 | 8368 |
| 13 | 13794 |
Code
So, which method is better? PCA or using all features in the original state:
Examining the silhouette scores from the previous parts we see that the best PCA clustering has a silhouette score of around .3 while the best clustering using all the features has a score of around .225. This is a large difference. The perk of the noise removal in PCA as well as uncorrelated principal components also boosts this method. Therefore the PCA clustering is better and will be used.
Characterizing the Clusters:
To delve into the specific characteristics of the songs within each cluster we will look at the average value of each numerical feature within each of the clusters, and for the categorical features we will look at the distribution of the levels of each of the categorical features within each cluster compared to the overall distribution of the levels of the categorical features. When looking at the average value of the numerical features within each cluster, we standardized the values so they all fall on a same scale, basically creating a z-score for each of the numerical features for each of the clusters. This will draw a clear picture for us of what features (song characteristics) make each cluster unique.
Code
for_graph <- audio %>%
select(acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, tempo, valence)
for_graph <- for_graph %>%
mutate_all(~scale(.))
new_column_names <- c("acousticness", "danceability", "energy", "instrumentalness", "liveness", "loudness", "speechiness", "tempo", "valence")
colnames(for_graph) <- new_column_names
for_graph$cluster = results_pca$clust
graph <- for_graph %>%
group_by(cluster) %>%
summarise(across(where(is.numeric), ~mean(.)))
long <- tidyr::gather(graph, key = "feature", value = "value", -cluster)
long$feature <- factor(long$feature)The three following tables display the distribution of the categorical variables key, mode, and time signature within each cluster, the last row in each of the tables contains the overall distribution in the dataset.
Code
for_table <- audio %>%
select(key, mode, time_signature)
for_table$cluster = as.factor(results_pca$clust)
two_way_table <- table(for_table$cluster, for_table$key)
normalized_table <- prop.table(two_way_table, margin=1)
rounded_table <- round(normalized_table, 3)
colnames(rounded_table) <- paste("Key =", colnames(normalized_table))
rownames(rounded_table) <- paste("Cluster =", rownames(normalized_table))
overall <- for_table %>%
group_by(key) %>%
summarize(Proportion = n() / nrow(for_table))
overall$Proportion = round(overall$Proportion, 3)
extended_table <- rbind(rounded_table, overall$Proportion)
kable(extended_table, caption = "Proportion of Key Values Within Each Cluster")| Key = 0 | Key = 1 | Key = 2 | Key = 3 | Key = 4 | Key = 5 | Key = 6 | Key = 7 | Key = 8 | Key = 9 | Key = 10 | Key = 11 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Cluster = 1 | 0.116 | 0.096 | 0.103 | 0.036 | 0.069 | 0.091 | 0.024 | 0.132 | 0.078 | 0.104 | 0.074 | 0.077 |
| Cluster = 2 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| Cluster = 3 | 0.378 | 0.000 | 0.336 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.286 | 0.000 | 0.000 | 0.000 |
| Cluster = 4 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| Cluster = 5 | 0.112 | 0.122 | 0.097 | 0.040 | 0.068 | 0.078 | 0.064 | 0.097 | 0.076 | 0.085 | 0.084 | 0.078 |
| Cluster = 6 | 0.244 | 0.092 | 0.155 | 0.099 | 0.042 | 0.070 | 0.000 | 0.106 | 0.074 | 0.074 | 0.000 | 0.043 |
| Cluster = 7 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| Cluster = 8 | 0.138 | 0.092 | 0.123 | 0.067 | 0.080 | 0.095 | 0.000 | 0.115 | 0.064 | 0.095 | 0.075 | 0.058 |
| Cluster = 9 | 0.024 | 0.000 | 0.006 | 0.001 | 0.000 | 0.000 | 0.967 | 0.001 | 0.001 | 0.000 | 0.000 | 0.000 |
| Cluster = 10 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| Cluster = 11 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| Cluster = 12 | 0.287 | 0.092 | 0.188 | 0.038 | 0.027 | 0.037 | 0.017 | 0.105 | 0.077 | 0.056 | 0.013 | 0.063 |
| Cluster = 13 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 |
| Cluster = 14 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 |
| Cluster = 15 | 0.000 | 0.000 | 0.000 | 0.218 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.782 |
| Cluster = 16 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 |
| Cluster = 17 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| 0.115 | 0.117 | 0.095 | 0.033 | 0.069 | 0.083 | 0.068 | 0.107 | 0.071 | 0.092 | 0.071 | 0.079 |
Code
two_way_table <- table(for_table$cluster, for_table$mode)
normalized_table <- prop.table(two_way_table, margin=1)
rounded_table <- round(normalized_table, 3)
colnames(rounded_table) <- paste("Mode =", colnames(normalized_table))
rownames(rounded_table) <- paste("Cluster =", rownames(normalized_table))
overall <- for_table %>%
group_by(mode) %>%
summarize(Proportion = n() / nrow(for_table))
overall$Proportion = round(overall$Proportion, 3)
extended_table <- rbind(rounded_table, overall$Proportion)
kable(extended_table, caption = "Proportion of Mode Values Within Each Cluster")| Mode = 0 | Mode = 1 | |
|---|---|---|
| Cluster = 1 | 0.296 | 0.704 |
| Cluster = 2 | 0.289 | 0.711 |
| Cluster = 3 | 0.251 | 0.749 |
| Cluster = 4 | 0.268 | 0.732 |
| Cluster = 5 | 0.383 | 0.617 |
| Cluster = 6 | 0.311 | 0.689 |
| Cluster = 7 | 0.603 | 0.397 |
| Cluster = 8 | 0.326 | 0.674 |
| Cluster = 9 | 0.624 | 0.376 |
| Cluster = 10 | 0.529 | 0.471 |
| Cluster = 11 | 0.397 | 0.603 |
| Cluster = 12 | 0.295 | 0.705 |
| Cluster = 13 | 0.626 | 0.374 |
| Cluster = 14 | 0.409 | 0.591 |
| Cluster = 15 | 0.600 | 0.400 |
| Cluster = 16 | 0.476 | 0.524 |
| Cluster = 17 | 0.513 | 0.487 |
| 0.392 | 0.608 |
Code
two_way_table <- table(for_table$cluster, for_table$time_signature)
normalized_table <- prop.table(two_way_table, margin=1)
rounded_table <- round(normalized_table, 3)
colnames(rounded_table) <- paste("Time Signature =", colnames(normalized_table))
rownames(rounded_table) <- paste("Cluster =", rownames(normalized_table))
overall <- for_table %>%
group_by(time_signature) %>%
summarize(Proportion = n() / nrow(for_table))
overall$Proportion = round(overall$Proportion, 3)
extended_table <- rbind(rounded_table, overall$Proportion)
kable(extended_table, caption = "Proportion of Time Signature Values Within Each Cluster")| Time Signature = 0 | Time Signature = 1 | Time Signature = 3 | Time Signature = 4 | Time Signature = 5 | |
|---|---|---|---|---|---|
| Cluster = 1 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 |
| Cluster = 2 | 0.000 | 0.006 | 0.000 | 0.994 | 0.000 |
| Cluster = 3 | 0.000 | 0.005 | 0.000 | 0.995 | 0.000 |
| Cluster = 4 | 0.000 | 0.009 | 0.000 | 0.991 | 0.000 |
| Cluster = 5 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
| Cluster = 6 | 0.017 | 0.072 | 0.000 | 0.911 | 0.000 |
| Cluster = 7 | 0.000 | 0.010 | 0.001 | 0.989 | 0.000 |
| Cluster = 8 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 |
| Cluster = 9 | 0.000 | 0.019 | 0.051 | 0.931 | 0.000 |
| Cluster = 10 | 0.000 | 0.008 | 0.000 | 0.991 | 0.000 |
| Cluster = 11 | 0.014 | 0.053 | 0.313 | 0.620 | 0.000 |
| Cluster = 12 | 0.017 | 0.043 | 0.029 | 0.912 | 0.000 |
| Cluster = 13 | 0.000 | 0.005 | 0.001 | 0.994 | 0.000 |
| Cluster = 14 | 0.016 | 0.050 | 0.027 | 0.907 | 0.000 |
| Cluster = 15 | 0.000 | 0.007 | 0.000 | 0.993 | 0.000 |
| Cluster = 16 | 0.000 | 0.007 | 0.000 | 0.993 | 0.000 |
| Cluster = 17 | 0.000 | 0.001 | 0.003 | 0.997 | 0.000 |
| 0.002 | 0.013 | 0.097 | 0.862 | 0.025 |
The following graphs contain all the standardized scores for each of the numerical features with the specific scores for the cluster we are isolating, highlighted. This allows us to identify what song characteristics of the songs within the clusters stand out.
Cluster 1: Fast Pace, Key 0
Using the graph below, this cluster is defined by high levels of valence and tempo, and low values of acousticness and instrumentalness. This cluster also has around 50% of its songs in key 0 while only 10% of all songs are in key 0. So, songs in this cluster are likely to be faster paced and not have a lot acoustics (more synthesizers used).
Code
highlight_cluster <- 1
ggplot(long, aes(x = feature, y = value, color = factor(cluster))) +
geom_point(size = 1.5, alpha = 0.5) +
geom_point(data = filter(long, cluster == highlight_cluster),
aes(x = feature, y = value),
color = "black", size = 1.5) +
labs(title = "Visualising Cluster Makeup",
subtitle = "Cluster 1",
y = "Cluster Standardized Center") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 40, hjust = 1),
plot.title = element_text(hjust = 0.5), # Center the title
plot.subtitle = element_text(hjust = 0.5), # Center the subtitle
plot.margin = margin(20, 20, 20, 20)) +
theme(axis.text.x = element_text(angle = 40, hjust = 1)) +
scale_color_manual(values = rep("grey", n_distinct(long$cluster))) +
guides(color = FALSE)Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
Cluster 2: High Tempo, Loudness, and Energy, Key 2, Major Mode
This cluster has very high values for loudness, tempo, and energy. Very low values for acousticness and instrumentalness. High proportion of major mode songs compared to the entire dataset and all songs are in key 2.
Code
highlight_cluster <- 2
ggplot(long, aes(x = feature, y = value, color = factor(cluster))) +
geom_point(size = 1.5, alpha = 0.5) +
geom_point(data = filter(long, cluster == highlight_cluster),
aes(x = feature, y = value),
color = "black", size = 1.5) +
labs(title = "Visualising Cluster Makeup",
subtitle = "Cluster 2",
y = "Cluster Standardized Center") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 40, hjust = 1),
plot.title = element_text(hjust = 0.5), # Center the title
plot.subtitle = element_text(hjust = 0.5), # Center the subtitle
plot.margin = margin(20, 20, 20, 20)) +
theme(axis.text.x = element_text(angle = 40, hjust = 1)) +
scale_color_manual(values = rep("grey", n_distinct(long$cluster))) +
guides(color = FALSE)Cluster 3: Low Tempo and Valence, Fairly High Acousticness, Key 10, Time Signature 3
This cluster is the opposite of the previous ones, it has higher values for acousticness and instrumentalness and lower values for the features having to with the loudness and pace of the songs. High proportion of time signature 3 songs and key 10 songs.
Code
highlight_cluster <- 3
ggplot(long, aes(x = feature, y = value, color = factor(cluster))) +
geom_point(size = 1.5, alpha = 0.5) +
geom_point(data = filter(long, cluster == highlight_cluster),
aes(x = feature, y = value),
color = "black", size = 1.5) +
labs(title = "Visualising Cluster Makeup",
subtitle = "Cluster 3",
y = "Cluster Standardized Center") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 40, hjust = 1),
plot.title = element_text(hjust = 0.5), # Center the title
plot.subtitle = element_text(hjust = 0.5), # Center the subtitle
plot.margin = margin(20, 20, 20, 20)) +
theme(axis.text.x = element_text(angle = 40, hjust = 1)) +
scale_color_manual(values = rep("grey", n_distinct(long$cluster))) +
guides(color = FALSE)Cluster 4: Slow Tempo, High Acousticness, Key 2
Same as cluster 3, but now all the songs are key 2 instead of 10.
Code
highlight_cluster <- 4
ggplot(long, aes(x = feature, y = value, color = factor(cluster))) +
geom_point(size = 1.5, alpha = 0.5) +
geom_point(data = filter(long, cluster == highlight_cluster),
aes(x = feature, y = value),
color = "black", size = 1.5) +
labs(title = "Visualising Cluster Makeup",
subtitle = "Cluster 4",
y = "Cluster Standardized Center") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 40, hjust = 1),
plot.title = element_text(hjust = 0.5), # Center the title
plot.subtitle = element_text(hjust = 0.5), # Center the subtitle
plot.margin = margin(20, 20, 20, 20)) +
theme(axis.text.x = element_text(angle = 40, hjust = 1)) +
scale_color_manual(values = rep("grey", n_distinct(long$cluster))) +
guides(color = FALSE)Cluster 5: Fairly Balanced, Moderate Pace, Key 7, Major Mode
This group of songs has fairly average values for the numerical features but is definitely on the less acoustic side, and moderate pace. All songs are in key 7. High proportion of songs in the major mode compared to the rest of the dataset.
Code
highlight_cluster <- 5
ggplot(long, aes(x = feature, y = value, color = factor(cluster))) +
geom_point(size = 1.5, alpha = 0.5) +
geom_point(data = filter(long, cluster == highlight_cluster),
aes(x = feature, y = value),
color = "black", size = 1.5) +
labs(title = "Visualising Cluster Makeup",
subtitle = "Cluster 5",
y = "Cluster Standardized Center") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 40, hjust = 1),
plot.title = element_text(hjust = 0.5), # Center the title
plot.subtitle = element_text(hjust = 0.5), # Center the subtitle
plot.margin = margin(20, 20, 20, 20)) +
theme(axis.text.x = element_text(angle = 40, hjust = 1)) +
scale_color_manual(values = rep("grey", n_distinct(long$cluster))) +
guides(color = FALSE)Cluster 6: Balanced, Major Mode, Time Signature 3, Key 3
This cluster is even more balanced that the previous cluster, no extreme values for the numerical features.Almost all of the songs are in time signature 3, this is rare. All the songs are in key 3, and a very high proportion of major mode songs.
Code
highlight_cluster <- 6
ggplot(long, aes(x = feature, y = value, color = factor(cluster))) +
geom_point(size = 1.5, alpha = 0.5) +
geom_point(data = filter(long, cluster == highlight_cluster),
aes(x = feature, y = value),
color = "black", size = 1.5) +
labs(title = "Visualising Cluster Makeup",
subtitle = "Cluster 6",
y = "Cluster Standardized Center") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 40, hjust = 1),
plot.title = element_text(hjust = 0.5), # Center the title
plot.subtitle = element_text(hjust = 0.5), # Center the subtitle
plot.margin = margin(20, 20, 20, 20)) +
theme(axis.text.x = element_text(angle = 40, hjust = 1)) +
scale_color_manual(values = rep("grey", n_distinct(long$cluster))) +
guides(color = FALSE)Cluster 7: Extreme Acousticness and Instrumentalness, Slow Pace, Balanced Keys
This cluster is defined by extremes. It has the highest value for acousticness and instrumentalness. Lowest values for tempo, valence, danceability, etc. Very balanced distribution of keys. songs in this group are likely to have no words, maybe classical music.
Code
highlight_cluster <- 7
ggplot(long, aes(x = feature, y = value, color = factor(cluster))) +
geom_point(size = 1.5, alpha = 0.5) +
geom_point(data = filter(long, cluster == highlight_cluster),
aes(x = feature, y = value),
color = "black", size = 1.5) +
labs(title = "Visualising Cluster Makeup",
subtitle = "Cluster 7",
y = "Cluster Standardized Center") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 40, hjust = 1),
plot.title = element_text(hjust = 0.5), # Center the title
plot.subtitle = element_text(hjust = 0.5), # Center the subtitle
plot.margin = margin(20, 20, 20, 20)) +
theme(axis.text.x = element_text(angle = 40, hjust = 1)) +
scale_color_manual(values = rep("grey", n_distinct(long$cluster))) +
guides(color = FALSE)Cluster 8: Balanced, Time Signature 4, Key 3, Minor Mode
Again, we have a very balanced cluster in terms of the numerical features. All the songs are in time signature 4, all songs in key 3, and high proportion of songs with minor mode.
Code
highlight_cluster <- 8
ggplot(long, aes(x = feature, y = value, color = factor(cluster))) +
geom_point(size = 1.5, alpha = 0.5) +
geom_point(data = filter(long, cluster == highlight_cluster),
aes(x = feature, y = value),
color = "black", size = 1.5) +
labs(title = "Visualising Cluster Makeup",
subtitle = "Cluster 8",
y = "Cluster Standardized Center") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 40, hjust = 1),
plot.title = element_text(hjust = 0.5), # Center the title
plot.subtitle = element_text(hjust = 0.5), # Center the subtitle
plot.margin = margin(20, 20, 20, 20)) +
theme(axis.text.x = element_text(angle = 40, hjust = 1)) +
scale_color_manual(values = rep("grey", n_distinct(long$cluster))) +
guides(color = FALSE)Cluster 9: Fast Paced, Not a Lot of Instruments, Key 4 and 11, Minor Mode
This cluster is back to the old trend, fast paced and low instrumentalness/acousticness. Half of the songs are in key 4 and half in 11. Very high proportion of songs with minor mode.
Code
highlight_cluster <- 9
ggplot(long, aes(x = feature, y = value, color = factor(cluster))) +
geom_point(size = 1.5, alpha = 0.5) +
geom_point(data = filter(long, cluster == highlight_cluster),
aes(x = feature, y = value),
color = "black", size = 1.5) +
labs(title = "Visualising Cluster Makeup",
subtitle = "Cluster 9",
y = "Cluster Standardized Center") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 40, hjust = 1),
plot.title = element_text(hjust = 0.5), # Center the title
plot.subtitle = element_text(hjust = 0.5), # Center the subtitle
plot.margin = margin(20, 20, 20, 20)) +
theme(axis.text.x = element_text(angle = 40, hjust = 1)) +
scale_color_manual(values = rep("grey", n_distinct(long$cluster))) +
guides(color = FALSE)Cluster 10: Very Fast Paced and Danceable, Minor Mode, Key 10
Very high danceability, valence, and tempo. Low instrumentalness and acousticness. Very high proportion of songs in minor mode, and all songs in key 10.
Code
highlight_cluster <- 10
ggplot(long, aes(x = feature, y = value, color = factor(cluster))) +
geom_point(size = 1.5, alpha = 0.5) +
geom_point(data = filter(long, cluster == highlight_cluster),
aes(x = feature, y = value),
color = "black", size = 1.5) +
labs(title = "Visualising Cluster Makeup",
subtitle = "Cluster 10",
y = "Cluster Standardized Center") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 40, hjust = 1),
plot.title = element_text(hjust = 0.5), # Center the title
plot.subtitle = element_text(hjust = 0.5), # Center the subtitle
plot.margin = margin(20, 20, 20, 20)) +
theme(axis.text.x = element_text(angle = 40, hjust = 1)) +
scale_color_manual(values = rep("grey", n_distinct(long$cluster))) +
guides(color = FALSE)Cluster 11: Fairly High Pace, Key 8
This cluster is back to the old trend, fairly high pace and low instrumentalness/acousticness. All songs are in key 8.
Code
highlight_cluster <- 11
ggplot(long, aes(x = feature, y = value, color = factor(cluster))) +
geom_point(size = 1.5, alpha = 0.5) +
geom_point(data = filter(long, cluster == highlight_cluster),
aes(x = feature, y = value),
color = "black", size = 1.5) +
labs(title = "Visualising Cluster Makeup",
subtitle = "Cluster 11",
y = "Cluster Standardized Center") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 40, hjust = 1),
plot.title = element_text(hjust = 0.5), # Center the title
plot.subtitle = element_text(hjust = 0.5), # Center the subtitle
plot.margin = margin(20, 20, 20, 20)) +
theme(axis.text.x = element_text(angle = 40, hjust = 1)) +
scale_color_manual(values = rep("grey", n_distinct(long$cluster))) +
guides(color = FALSE)Cluster 12: Fairly High Tempo, Key 6, Minor Mode
Very similar to cluster 11, except slightly less extreme in all numerical features. All songs in key 6 and songs tend to be in minor mode.
Code
highlight_cluster <- 12
ggplot(long, aes(x = feature, y = value, color = factor(cluster))) +
geom_point(size = 1.5, alpha = 0.5) +
geom_point(data = filter(long, cluster == highlight_cluster),
aes(x = feature, y = value),
color = "black", size = 1.5) +
labs(title = "Visualising Cluster Makeup",
subtitle = "Cluster 12",
y = "Cluster Standardized Center") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 40, hjust = 1),
plot.title = element_text(hjust = 0.5), # Center the title
plot.subtitle = element_text(hjust = 0.5), # Center the subtitle
plot.margin = margin(20, 20, 20, 20)) +
theme(axis.text.x = element_text(angle = 40, hjust = 1)) +
scale_color_manual(values = rep("grey", n_distinct(long$cluster))) +
guides(color = FALSE)Cluster 13: Very High Speechiness, Key 1
Very high danceability, tempo, and speechiness. Very low values of acousticness. All of the songs are in key 1.
Code
highlight_cluster <- 13
ggplot(long, aes(x = feature, y = value, color = factor(cluster))) +
geom_point(size = 1.5, alpha = 0.5) +
geom_point(data = filter(long, cluster == highlight_cluster),
aes(x = feature, y = value),
color = "black", size = 1.5) +
labs(title = "Visualising Cluster Makeup",
subtitle = "Cluster 13",
y = "Cluster Standardized Center") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 40, hjust = 1),
plot.title = element_text(hjust = 0.5), # Center the title
plot.subtitle = element_text(hjust = 0.5), # Center the subtitle
plot.margin = margin(20, 20, 20, 20)) +
theme(axis.text.x = element_text(angle = 40, hjust = 1)) +
scale_color_manual(values = rep("grey", n_distinct(long$cluster))) +
guides(color = FALSE)Cluster 14: Relatively Fast Paced, Mixed Time Signatures, Minor Mode
Back to the old trend, faster paced song with low instrumentalness. All songs are in key 5 and very high proportion of minor mode songs compared to the rest of the dataset.
Code
highlight_cluster <- 14
ggplot(long, aes(x = feature, y = value, color = factor(cluster))) +
geom_point(size = 1.5, alpha = 0.5) +
geom_point(data = filter(long, cluster == highlight_cluster),
aes(x = feature, y = value),
color = "black", size = 1.5) +
labs(title = "Visualising Cluster Makeup",
subtitle = "Cluster 14",
y = "Cluster Standardized Center") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 40, hjust = 1),
plot.title = element_text(hjust = 0.5), # Center the title
plot.subtitle = element_text(hjust = 0.5), # Center the subtitle
plot.margin = margin(20, 20, 20, 20)) +
theme(axis.text.x = element_text(angle = 40, hjust = 1)) +
scale_color_manual(values = rep("grey", n_distinct(long$cluster))) +
guides(color = FALSE)Cluster 15: Balanced, Time Signature 3
Middle of the road in almost every numerical category. All songs are in time signature, this is very rare, very dispersed distribution of keys.
Code
highlight_cluster <- 15
ggplot(long, aes(x = feature, y = value, color = factor(cluster))) +
geom_point(size = 1.5, alpha = 0.5) +
geom_point(data = filter(long, cluster == highlight_cluster),
aes(x = feature, y = value),
color = "black", size = 1.5) +
labs(title = "Visualising Cluster Makeup",
subtitle = "Cluster 15",
y = "Cluster Standardized Center") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 40, hjust = 1),
plot.title = element_text(hjust = 0.5), # Center the title
plot.subtitle = element_text(hjust = 0.5), # Center the subtitle
plot.margin = margin(20, 20, 20, 20)) +
theme(axis.text.x = element_text(angle = 40, hjust = 1)) +
scale_color_manual(values = rep("grey", n_distinct(long$cluster))) +
guides(color = FALSE)Cluster 16: Extreme Speechiness and Liveness, Time Signature 5
Extreme values of speechiness and liveness. Tends to point to these songs being poems or even podcasts. All songs in time signature 5
Code
highlight_cluster <- 16
ggplot(long, aes(x = feature, y = value, color = factor(cluster))) +
geom_point(size = 1.5, alpha = 0.5) +
geom_point(data = filter(long, cluster == highlight_cluster),
aes(x = feature, y = value),
color = "black", size = 1.5) +
labs(title = "Visualising Cluster Makeup",
subtitle = "Cluster 16",
y = "Cluster Standardized Center") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 40, hjust = 1),
plot.title = element_text(hjust = 0.5), # Center the title
plot.subtitle = element_text(hjust = 0.5), # Center the subtitle
plot.margin = margin(20, 20, 20, 20)) +
theme(axis.text.x = element_text(angle = 40, hjust = 1)) +
scale_color_manual(values = rep("grey", n_distinct(long$cluster))) +
guides(color = FALSE)Cluster 17: Slow Pace, High Instrumentalness, Key 5
This goes back to the slower paced, high instrumentalness/acousuticness. All songs in key 5.
Code
highlight_cluster <- 17
ggplot(long, aes(x = feature, y = value, color = factor(cluster))) +
geom_point(size = 1.5, alpha = 0.5) +
geom_point(data = filter(long, cluster == highlight_cluster),
aes(x = feature, y = value),
color = "black", size = 1.5) +
labs(title = "Visualising Cluster Makeup",
subtitle = "Cluster 17",
y = "Cluster Standardized Center") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 40, hjust = 1),
plot.title = element_text(hjust = 0.5), # Center the title
plot.subtitle = element_text(hjust = 0.5), # Center the subtitle
plot.margin = margin(20, 20, 20, 20)) +
theme(axis.text.x = element_text(angle = 40, hjust = 1)) +
scale_color_manual(values = rep("grey", n_distinct(long$cluster))) +
guides(color = FALSE)Secondary Goal (Genre Prediction):
Now we will use a secondary dataset which contains all the same features that the dataset for clustering has, this time with different songs. This dataset also has the song genre as a column, this is the variable we are trying to predict. Below some data pre-processing is being done, converting the categorical variable to factors and dropping columns that are unique to each song, as they won’t help in predicting the genre. It is worth noting that the original dataset contained 114 different genres, we regrouped these genres into 13 more general genres. Some of the genres were very niche and many people who would be using the shiny app would not understand what some of the output genres were. The regrouping is for ease of use purposes.
The model specification that we are going with is a random forest model. This is a good choice for our data because we have a decently large amount of predictors, and we are not entirely sure as to which are going to be the most important in predicting genre. Random forests uses a variety of predictors for each subtree, so, we don’t have to choose which predictors to include in the model with the lack of context we have on these predictors.
Below we are taking a random stratified sample of the entire dataset to perform hyperparameter tuning. This is needed because there are large number of songs in the dataset (114,000), we lack the computational power to perform tuning on that large of dataset. This reduction of the dataset allows us to tune the model while still giving it a large sample. Once the hyperparameters are chosen, the model will be fit on the full training data.
Code
set.seed(27)
rf_spec <- rand_forest(mtry = tune(),
min_n = tune(),
trees = 10) %>%
set_engine("ranger") %>%
set_mode("classification")
rf_grid <- grid_regular(mtry(c(3, 12)),
min_n(),
levels = 3
)
rf_wflow <- workflow() %>%
add_model(rf_spec) %>%
add_recipe(genre_recipe)
rf_grid_search <-
tune_grid(
rf_wflow,
resamples = genre_cvs,
grid = rf_grid
)
tuning_metrics <- as.data.frame(rf_grid_search %>% collect_metrics())
tuning_metrics <- filter(tuning_metrics, .metric=="accuracy")
kable(tuning_metrics %>%
filter(.metric == "accuracy") %>%
slice_max(mean) %>%
select(mtry, min_n, .metric, mean),
col.names=c("Mtry", "Min n", "Metric", "Value"),
align=c("c", "c", "c", "c")
)| Mtry | Min n | Metric | Value |
|---|---|---|---|
| 3 | 21 | accuracy | 0.355231 |
Code
| Mtry | Min n | Metric | Value |
|---|---|---|---|
| 3 | 2 | accuracy | 0.3372311 |
| 7 | 2 | accuracy | 0.3364620 |
| 12 | 2 | accuracy | 0.3404621 |
| 3 | 21 | accuracy | 0.3552310 |
| 7 | 21 | accuracy | 0.3488468 |
| 12 | 21 | accuracy | 0.3453084 |
| 3 | 40 | accuracy | 0.3433080 |
| 7 | 40 | accuracy | 0.3507696 |
| 12 | 40 | accuracy | 0.3474619 |
We see that the model that leads to the highest accuracy is the model with an mtry of 3 and min_n of 21. This also happens to be the 2nd simplest model, thus, it is an easy choice to move forward with this model. We then fit the model with these hyperparameters to the whole dataset.
Code
#detaching the caret package becasue it is leading to a namespace error in the below code
detach("package:caret", unload = TRUE)
full_genre_cvs <- vfold_cv(genre, v = 5, strata=track_genre)
table <- rf_fit %>%
fit_resamples(full_genre_cvs,
metrics = metric_set(accuracy)) %>%
collect_metrics()
kable(table %>%
select(.metric, mean),
col.names=c("Metric", "Value"),
align=c("c", "c")
)| Metric | Value |
|---|---|
| accuracy | 0.4758771 |
From the results we see that the accuracy was fairly good, it has a cross validated accuracy of around 0.47. The accuracy of .47 means that when predicting song genres using our model, we should expect that it will predict the correct genre around 47% of the time.
Ethical Implications / Limitations
Regarding the ethics of this analysis, one big issue could be the scope of the dataset being used for the recommendation algorithm. Not all artists are contained in the dataset, so, artists not present in the dataset will not be recommended as song choices for the user. The dataset contained a random selection of 130,000 songs that were present on Spotify as of April 2019. That is to say that there are no songs from May 2019 and on. This exclusion of certain artists is not a good thing. Since the songs were being collected randomly, artists with more songs on Spotify were more likely to be included in the dataset. All artists are not being given a fair shot. It is likely that smaller artists will have less songs in the dataset meaning they will have a lesser chance at being recommended, while more popular artists will likely have more songs and be more likely to be recommended. So, this algorithm in its current state may reinforce the status quo of the music industry by keeping popular artists popular and barring less popular artists from breaking through.
The other main issue is the timeframe of the songs used for the analysis. We are currently in 2023, the dataset used for the analysis does not contain any songs from the past 4 years, this means that our algorithm will not recommend any of these newer songs. This presents a fairly big issue as users are likely going to want to have some contemporary music in their recommended playlists, and the analysis in its current state does not provide this.